Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make rule removal depend on gap in output #22

Merged
merged 23 commits into from
Jun 27, 2023
Merged

Conversation

rikhuijzer
Copy link
Owner

@rikhuijzer rikhuijzer commented Jun 22, 2023

This PR fixes multiple problems:

  • Increases the precision of the rank calculation to avoid removing the wrong rules.
  • Simplifies the calculation for the _feature_space. The old calculation was wrong in some cases (test was added).
  • Sorts the rules by gap size before removal.
  • Improves docstrings.
  • Some refactorings.

Works towards #13. Maybe this PR already finishes #13 because the max_rules=10 scores are very close to the StableForestRegressor scores.

Before

23×7 DataFrame
 Row │ Dataset          Model                   Hyperparameters                    nfolds  AUC     RMS     1.96*SE
     │ String           String                  String                             Int64   String  String  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   3 │ blobs            StableRulesClassifier   (n_trees = 50,)                        10  1.00            0.00
  ...
   7 │ titanic          StableRulesClassifier   (n_trees = 1500,)                      10  0.81            0.04
  ...
  11 │ haberman         StableRulesClassifier   (n_trees = 1500,)                      10  0.67            0.02
  ...
  14 │ make_regression  StableForestRegressor   (n_trees = 1500,)                      10          0.78    0.04
  15 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.26    0.08
  16 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.35    0.07
  17 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.40    0.04
  ...
  20 │ boston           StableForestRegressor   (n_trees = 1500,)                      10          0.67    0.09
  21 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.17    0.08
  22 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.23    0.09
  23 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.30    0.08

After

23×7 DataFrame
 Row │ Dataset          Model                   Hyperparameters                    nfolds  AUC     RMS     1.96*SE
     │ String           String                  String                             Int64   String  String  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ..
   3 │ blobs            StableRulesClassifier   (n_trees = 50,)                        10  1.00            0.00
  ..
   7 │ titanic          StableRulesClassifier   (n_trees = 1500,)                      10  0.83            0.03
  ..
  11 │ haberman         StableRulesClassifier   (n_trees = 1500,)                      10  0.67            0.07
  ..
  14 │ make_regression  StableForestRegressor   (n_trees = 1500,)                      10          0.80    0.05
  15 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.54    0.09
  16 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.66    0.11
  17 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.70    0.08
  ..
  20 │ boston           StableForestRegressor   (n_trees = 1500,)                      10          0.67    0.09
  21 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.41    0.07
  22 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.57    0.08
  23 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.65    0.09

@rikhuijzer
Copy link
Owner Author

rikhuijzer commented Jun 22, 2023

I'm starting to have serious doubts about my src/dependent.jl implementation. For example, _unique_left_splits seems like a too eager simplification. Maybe I should re-write the code after re-reading https://cs.stackexchange.com/questions/152803.

@rikhuijzer
Copy link
Owner Author

I'm starting to have serious doubts about my src/dependent.jl implementation. For example, _unique_left_splits seems like a too eager simplification. Maybe I should re-write the code after re-reading https://cs.stackexchange.com/questions/152803.

The task for tomorrow is then simple, let's fully work through an example based on the explanation by D.W. Put this example in the Implementation Overview.

@rikhuijzer
Copy link
Owner Author

I'm starting to have serious doubts about my src/dependent.jl implementation. For example, _unique_left_splits seems like a too eager simplification. Maybe I should re-write the code after re-reading https://cs.stackexchange.com/questions/152803.

The task for tomorrow is then simple, let's fully work through an example based on the explanation by D.W. Put this example in the Implementation Overview.

Okay so the difficulty seems to be that a simple approach of converting the rules to a binary feature space is not as easy as thought. Linearly dependent rules will not be guaranteed to show up. Maybe there is an algorithm to automatically find linear dependence while throwing away constraints or otherwise I need to find the bug in my implementation of D.W.'s suggestion.

@rikhuijzer
Copy link
Owner Author

rikhuijzer commented Jun 25, 2023

Improved accuracy slightly after ordering the rules by gap size in 67e3ec6.

Before

23×7 DataFrame
 Row │ Dataset          Model                   Hyperparameters                    nfolds  AUC     RMS     1.96*SE
     │ String           String                  String                             Int64   String  String  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   3 │ blobs            StableRulesClassifier   (n_trees = 50,)                        10  1.00            0.00
  ...
   7 │ titanic          StableRulesClassifier   (n_trees = 1500,)                      10  0.81            0.04
  ...
  11 │ haberman         StableRulesClassifier   (n_trees = 1500,)                      10  0.67            0.02
  ...
  15 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.26    0.08
  16 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.35    0.07
  17 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.40    0.04
  ...
  20 │ boston           StableForestRegressor   (n_trees = 1500,)                      10          0.67    0.09
  21 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.17    0.08
  22 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.23    0.09
  23 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.30    0.08

After

23×7 DataFrame
 Row │ Dataset          Model                   Hyperparameters                    nfolds  AUC     RMS     1.96*SE
     │ String           String                  String                             Int64   String  String  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ...
   3 │ blobs            StableRulesClassifier   (n_trees = 50,)                        10  1.00            0.00
  ...
   7 │ titanic          StableRulesClassifier   (n_trees = 1500,)                      10  0.80            0.04
  ...
  11 │ haberman         StableRulesClassifier   (n_trees = 1500,)                      10  0.68            0.07
  ...
  15 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.48    0.07
  16 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.52    0.11
  17 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.55    0.06
  ...
  21 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.34    0.07
  22 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.35    0.07
  23 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.35    0.10

@rikhuijzer
Copy link
Owner Author

It looks like improving the rule post-processing step does really improve regression performance. Probably, there is still something wrong which explains why the performance is still so poor. This also explains why earlier I noticed that the rule extraction method didn't affect outcomes so much. It looks now like it does for regression but not for classification

@rikhuijzer
Copy link
Owner Author

rikhuijzer commented Jun 26, 2023

After 65907ae:

23×7 DataFrame
 Row │ Dataset          Model                   Hyperparameters                    nfolds  AUC     RMS     1.96*SE
     │ String           String                  String                             Int64   String  String  String
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ...
   3 │ blobs            StableRulesClassifier   (n_trees = 50,)                        10  1.00            0.00
  ...
   7 │ titanic          StableRulesClassifier   (n_trees = 1500,)                      10  0.84            0.03
  ...
  11 │ haberman         StableRulesClassifier   (n_trees = 1500,)                      10  0.67            0.06
  ...
  15 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.64    0.08
  16 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.57    0.08
  17 │ make_regression  StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.66    0.07
  ...
  21 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 100)      10          0.43    0.06
  22 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 30)       10          0.53    0.08
  23 │ boston           StableRulesRegressor    (n_trees = 1500, max_rules = 10)       10          0.55    0.09

@rikhuijzer
Copy link
Owner Author

rikhuijzer commented Jun 26, 2023

When not simplifying the single rules, then _filter_linearly_dependent might remove too many rules as can be found via

@test length(S._process_rules(repeat(allrules, 34), algo, 9)) == 9

resulting in 8 == 9. So there is still something wrong with the filter.

@rikhuijzer
Copy link
Owner Author

rikhuijzer commented Jun 27, 2023

Localized the following bug in eb74b60. Happens only when the number of repeats is greater or equal to 34:

julia> r1
SIRUS.Rule(TreePath(" X[i, 1] < 32000.0 "), [0.061], [0.408])

julia> r2
SIRUS.Rule(TreePath(" X[i, 1] ≥ 32000.0 "), [0.408], [0.061])

julia> r4
SIRUS.Rule(TreePath(" X[i, 2] ≥ 8000.0 "), [0.386], [0.062])

julia> dependent = S._linearly_dependent([repeat([r2, r1], 34); r4], A, B)
69-element BitVector:
 0
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 0
 1
 1

Here, r4 should definitely not be considered linearly dependent. However, for some reason, there is a one (1) not at the last index but a few indexes before that.

EDIT: Fixed by using rank(A; atol=1e-6).

@rikhuijzer rikhuijzer enabled auto-merge (squash) June 27, 2023 09:06
@rikhuijzer rikhuijzer merged commit c608c4e into main Jun 27, 2023
4 checks passed
@rikhuijzer rikhuijzer deleted the rh/regression-part-6 branch June 27, 2023 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant